Parallel corpus alignment at the document, sentence and vocabulary levels
نویسنده
چکیده
This paper presents a language independent algorithm for the alignment of parallel corpora at the document, sentence and vocabulary levels using the to-be aligned corpus itself as the only source of information. The input is a set of documents written in two unknown languages A and B, where every document in language A has its corresponding translation into language B. The problem thus consists of: 1) dividing the set of documents in the two languages; 2) aligning at the document level to determine which document in language A is the original (or translation) of each document in language B; 3) aligning at the sentence level to determine which sentence in the original corresponds to each sentence in the translation and 4) aligning at the vocabulary level to determine which word in one language is equivalent to each word in the translation. The algorithm is iterative, using the resulting bilingual vocabulary to re-align the corpus. Evaluation figures in English, Spanish and French show competitive results at all levels of the alignment.
منابع مشابه
Contribution to Terminology Internationalization by Word Alignment in Parallel Corpora
BACKGROUND AND OBJECTIVES Creating a complete translation of a large vocabulary is a time-consuming task, which requires skilled and knowledgeable medical translators. Our goal is to examine to which extent such a task can be alleviated by a specific natural language processing technique, word alignment in parallel corpora. We experiment with translation from English to French. METHODS Build ...
متن کاملDCEP -Digital Corpus of the European Parliament
We are presenting a new highly multilingual document-aligned parallel corpus called DCEP Digital Corpus of the European Parliament. It consists of various document types covering a wide range of subject domains. With a total of 1.37 billion words in 23 languages (253 language pairs), gathered in the course of ten years, this is the largest single release of documents by a European Union institu...
متن کاملLarge - Scale Automatic Extraction of anEnglish - Chinese Translation
We report experimental results on automatic extraction of an English-Chinese translation lexicon, by statistical analysis of a large parallel corpus, using limited amounts of linguistic knowledge. To our knowledge, these are the rst empirical results of the kind between an Indo-Europeanand non-Indo-Europeanlanguage for any signiicantvocabulary and corpus size. The learned vocabulary size is abo...
متن کاملSentence Alignment in Parallel, Comparable, and Quasi-comparable Corpora
We explore the usability of different bilingual corpora for the purpose of multilingual and cross-lingual natural language processing. The usability of bilingual corpus is evaluated by the lexical alignment score calculated for the bi-lexicon pair distributed in the aligned bilingual sentence pairs. We compare and contrast a number of bilingual corpora, ranging from parallel, to comparable, and...
متن کاملDeveloping Bilingual Plagiarism Detection Corpus Using Sentence Aligned Parallel Corpus: Notebook for PAN at CLEF 2015
Plagiarism detection is the process of locating text reuse within a suspicious document. The plagiarism detection corpora are used for evaluating plagiarism detection systems. In this paper, we present a bilingual PersianEnglish plagiarism detection corpus. We provide our corpus for the task of text alignment corpus construction in the PAN 2015 competition. Our approach is based on parallel cor...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- Procesamiento del Lenguaje Natural
دوره 47 شماره
صفحات -
تاریخ انتشار 2011